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SUMMARY 


STARSMART is a highly structured, optimized and efficient 
infrared line-by-line radiative transfer program designed for 
the Langley Research Center CDC STAR-100 computer. A vector 
oriented PASCAL derivative language, SL/1, is used to reduce 
core requirements, development effort and run time. The half 
word arithmetic feature, virtual memory and vector processing 
of the STAR-100 are employed to reduce execution time by a 
factor of 99 over an earlier serial code performing identical 
tasks. The radiative transfer formulation is analyzed in depth 
to produce optimum vectorized code, data structure and efficient 
page management schemes. In addition, the STAR— 100 halfword 
arithmetic and input data characteristics are evaluated for condi- 
tions leading to incorrect results or loss of significance. 

The computational speed and storage requirements of STARSMART 
are shown relative to 2 earlier vector and serial codes. Benchmarks 
are performed with simple and elaborate atmospheric models. The 
largest test case assumes a midlatitude summer atmosphere with 
4 gases and 15 layers. The bandpass extends from 2075 to 2215 cm-1 
and contains 4666 spectral lines with 14000 integration points. The 
required 979,860,000 calculations are performed in 120.88 CPU seconds. 
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SECTION 1 - INTRODUCTION 


A line-by-line radiative transfer program (SMART) (Ref. 2) 
has been running on the conventional serial machines (CDC Series 
6000 and CYBER 170's) at NASA Langley Research Center since the 
early seventies. This code calculates the infrared spectral 
flux at points within a given frequency interval, propagating 
through a non-homogeneous atmosphere. SMART is primarily used to 
simulate the instrument response of gas filter correlation radio- 
meters developed to monitor air pollution. 

Studies requiring large bandpasses, high spectral resolution, 
multiple gases and many atmospheric layers are costly to run. 
Furthermore, the storage limitations imposed by the core size 
of the serial machines have made the more finely tuned cases 
impossible to run without incurring significant additional costs 
of design and checkout of less efficient overlaid schemes. Such 
segmented approaches are of limited use because upcoming releases 
of the FORTRAN compiler are expected to delete the current overlay 
capability. 

In principle, virtual memory allows the software designer to 
handle large data structures with no concern for physical memory 
size; and the improved computational efficiency of the vector 
oriented machine should provide the optimum capability to run large 
models at moderate cost. 
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Therefore an effort was undertaken to adapt and optimize 
the SMART serial code for use on the STAR-100. The starting 
point for this task was a partially completed vector code. 
However, analysis of this code revealed errors in logic and 
inefficient use of the vector capabilities. Because of these 
problems, it was necessary to begin again from the serial ver- 
sion. The approach used in this effort, the details of the pro 
gramming procedures and results in terms of improved efficiency 
are described in the following sections. 
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SECTION 2 - APPROACH 


The radiative transfer calculations as performed by SMART 
can be viewed as a two phase process. The first phase involves 
the processing of user parameters which define the multi-layer 
atmospheric model and the calculation of wavenumber dependent 
absorption coefficients for each layer at each user specified 
integration point. In the second phase the calculation of the 
transmissivity and emissivities is performed at each integration 
point. The optical path and gas concentrations in each layer as 
well as solar and surface effects are included in the calculations. 
These values are used as inputs to parametric studies of instrument 
response to atmospheric conditions. 

Of the two phases, the first uses the most computer and 
peripheral time. For a virtual memory code to result in an over- 
all cost reduction for large cases, the first phase would have to 
be recast in a strongly vectorized form. The second phase would 
still be implemented on the serial machine where access to the 
more sophisticated system software allows graphic and tabular 
display. 

The adaptation and optimization of the serial code to the 
STAR-100 was broken down into the following tasks: 

(1) analysis of code for vectorization 

(2) structuring of data and development of page management 
schemes 
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(3) selection of an appropriate language 

(4) comparison of implicit and explicit I/O 

(5) evaluation of computer arithmetic 

(6) benchmarking and analysis of results 
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SECTION 3 - LANGUAGE 


The selection of an appropriate language simplifies the 
design of an efficient algorithm by satisfying the specific 
needs and structure of the task. Consideration should be given 
to the control structures, data types, diagnostics and levels 
of optimization offered by the available languages. A higher 
order language is a software tool which enables the designer 
to exploit hardware features without having to introduce unclear 
code. Good code should be easy to read, understand, maintain 
and modify. 

The STAR-100 supports three languages: 

(1) STAR FORTRAN - A vectorized upward compatible version 
ans X3.9 FORTRAN 66 

(2) META - The STAR-100 assembler language 

(3) SL/1 - A vector-oriented PASCAL derivative 

The first version of the vectorized code (STARSMART) was ' 
coded in STAR FORTRAN. Its similarity to the serial NOS FORTRAN 
made it the best language to use for a quick transliteration of 
the serial code onto the STAR-100. Although STAR FORTRAN was the 
initial target language, it does not take advantage of the power- 
ful halfword arithmetic features of the STAR-100. Halfword opera- 
tions could be performed by STAR FORTRAN callable META subroutines, 
but any halfword values had to remain internal to the META code. 
Another deficiency of STAR FORTRAN is the lack of runtime diagnos- 
tic support. For example, upon abnormal termination of a STAR 
FORTRAN run, the user is left with a hard to decipher hexadecimal 
core dump. 
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Exclusive programming in META was ruled out because the 
time required to generate and validate assembler code and the 
inherent difficulty in documentation were judged not to be worth 
the extra speed that lower-level code could give. 

SL/1 is a LaRC developed PASCAL derivative (Ref. 7) which 
supports most of the features found in the serial version of 
the PASCAL compiler. Compound control constructs, explicit data 
typing, modularity and variable scope encourage the programmer 
to use structured techniques during the design phase. Use of 
such constructs as the DO UNTIL and IF-THEN-ELSE simplifies the 
code. The free format feature of SL/1 reduces the time spent 
keying in code and a source text reformatting utility which in- 
dents code increases the readability of the program. Halfword 
arithmetic, accessable via the SHORT REAL data type, provides 
a speedup of 2 for addition and of 4 for multiplication. The 
SL/1 library contains all necessary intrinsic functions (sine , 
cosine etc.) in halfword form. 

The code produced by the SL/1 compiler takes less core and 
time to run. This is exhibited by the Voight profile approxima- 
tion benchmarks (Table 1) . The actual design and checkout time 
spent on the SL/1 code was less than would have been spent on 
comparable FORTRAN code because of the features of the SL/1 com- 
piler. 

SL/1 is actually a cross-compiler resident on a CYBER 173. 
This allows the user to enter code on a machine with sophisticated 
text editing capabilities and to compile modules interactively. 

In contrast, the FORTRAN user must pass source code over a data 
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link to the STAR-100 for compilation. Use of the SL/1 compiler 
also reduces compilation time spent in correcting syntax errors. 

During actual validation runs, the SL/1 CHECK option was 
turned on to generate extra code at compile time to test such 
error conditions as out-of-range vector and array subscripts 
and invalid arguments at run time. STAR FORTRAN has no compar- 
able facility. In the case of an abnormal termination, SL/1 
generates a symbolic dump and as many levels of trace back as are 
necessary. This feature significantly reduces the time and effort 
spent in the detection and correction of runtime errors. 

In general, the SL/1 version of STARSMART is easier to design 
and implement because of the higher level of software support 
offered by the SL/1 compiler. The machine code generated by the 
SL/1 compiler runs faster and occupies less core. All of these 
benefits were available without losing compatibility with STAR 
FORTRAN. In fact, certain I/O routines which had to be written 
in FORTRAN during earlier releases of the SL/1 compiler were 
called by the SL/1 main module. As the SL/1 compiler was upgraded, 
these FORTRAN routines were replaced by SL/1 procedures. The 
current version of STARSMART is coded exclusively in SL/1. 
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SECTION 4 - VECTORIZATION 


The computational power of the STAR-100 can only be 
realized with well structured vectorized code. Maximum 
efficiency is achieved when the code which performs CPU 
intensive work is treated as a series of operations on opti- 
mal length vectors. Vectors should be long enough to over- 
ride the penalties of vector operation start-up times and 
short enough to prevent excessive disk transfers during exe- 
cution. These transfers - called page faults - occur when 
data or code must be brought in to central memory to continue 
program execution. 

The following assumptions were made to simplify the task 
of vectorization . 

(1) Uniform increments between integration points would 
suffice because virtual memory allows large enough vectors to 
satisfy resolution requirements. 

(2) Absorption coefficients would be kept separated with 
respect to gas as well as layer. Virtual memory would eliminate 
the questions of trade-off between storage and added flexibility. 

(3) Only atmospheric layers would be considered in the 
model, instead of the combination of atmospheric instrument and 
calibration layers allowed by the serial code. 

Code in the first phase of SMART containing explicit looping 
was studied for vectorization potential. Not all looped code lends 
itself to efficient use of STAR-100 vector instructions. Character- 
istics of vectorizable code such as the parallelism of repeated 
computations and independence of source and destination operands 
over a large data set were identified in the serial code. 
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The computational logic of SMART consists of three major 
loops. The outer major loop over wavenumbers within the band- 
pass, an implicit DO-UNTIL implemented by a FORTRAN IF test, 
was translated into an explicit loop on the STAR-100. The 
uniform increment of integration, DW between user specified, 
upper and lower bounds of integration U and L, defines the 
number of integration points INT as 

INT = (U-L)/DW 

The approximate length of these vectors for a typical 
case is 2000. Computations involving vectors of this length 
enter the regime of diminishing returns on the theoretical 
trade-off curves of vector-length versus time on the STAR-100. 

At a point, the cost of paging in segments of long vectors for 
arithmetic operations and building temporaries begins to degrade 
the performance of the vector operations. Therefore, these 
vectors were subdivided into blocks of 250 contiguous points 
to allow calculation of the absorption coefficients in vector 
form without excessive page faulting. 

Within the major loop over vector blocks of integration 
points were loops at each point over the gases, layers and 
surrounding spectral lines. These loops were too small to 
vectorize efficiently and were left in the STAR-100 code as 
explicit loops operating on the vector blocks of integration 
data. 

At the core of these loops were the calculations of the 
Voigt profiles of spectral lines and Planck radiation functions. 
The high frequency of these calculations made it imperative 
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to generate a highly efficient vectorized implementation of 
these functions. For example, the Voigt profile value is 
calculated at each integration point for each layer at each 
spectral line. The midlatitude summer model (15 layers, 4665 
lines, and 14000 points) (Ref. 10) requires 979,860,000 calcu- 
lations. 

Several candidates for an approximation to the Voigt pro- 
file were tested on the serial machines for accuracy and speed. 
Two algorithms, that of Drayson (Ref. 6) and that of Pierluissi 
(Ref. 3) were selected for adaptation to the STAR-100. 

Both codes were easily translated into SL/1 as scalar 
functions. However, each would have to be called multiple 
times within each block of integration points being processed, 
and timing studies revealed that the serial code was faster 
than the STAR-100 scalar code. However, since the penalties 
of scalar operations on the STAR-100 offset any gains made by 
faster hardware, a vectorized approach to the calculation was 
developed. The Drayson algorithm violated the criterion of 
independence of source and destination operands and was rejected 
as a candidate for vectorization . The Pierluissi code, which 
is a logically simpler code, lends itself to vectorization and 
was rewritten. The vectorized code proved to be 3 times faster 
than the STAR-100 scalar code and comparable to the speed of the 
serial code. Moreover, when vector lengths were increased the 


4-3 



effects of start-up time were reduced and the STAR-100 code 
executed in less time than the serial code. This one modification 
to code used many times in the course of a run resulted in a time 
reduction of a factor of 2 over the original SMART Voigt profile 
algorithm. 
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SECTION 5 - PAGE MANAGEMENT 


Early runs made with the initial version of vectorized 
STARSMART code gave deceptively encouraging results. Cases 
dealing with bandpasses of less than 90 cm-1 in regions of sparse 
spectral lines executed on the order of 90 times faster in terms 
of raw CPU time than when run with the serial code (Table 5) . 
However, the submittal of a full-scale problem in a dense spectral 
region (the 15 layer midlatitude summer atmosphere) resulted 
in computational catastrophe. After 15 minutes of elapsed time 
on the STAR-100 ,only one 250 member block of integration points 
had been processed. The program had spent nearly four-fifths of 
its allocated central processor time swapping portions of data 
in and out to disk. 

This phenomenon - called thrashing - degrades machine per- 
formance to the extent that essentially no computational work 
can be done without a pagefault. Thrashing is usually the re- 
sult of data clash or lack of program locality. Data clash occurs 
when the data base is structured or accessed inefficiently. 
Algorithms which exhibit a high degree of program locality are 
designed to execute over a small area of code at a time without 
branching to parts of the program or using parts of a data base 
which must be paged into memory. STARSMART code was on small 
pages and was in a compact modular form which minimized the pro- 
gram locality problem. A study of the data transfer in the pro- 
gram was undertaken to determine the cause of the thrashing and 
to correct the problem. 
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Three steps were then taken to decrease the amount of com- 
puter time spent in paging. The source of the thrashing was 
traced to the accesses of the precalculated Lorentz and Doppler 
halfwidth information for each line within the band pass. These 
values were stored in a 3-dimensional array ordered in the 
columnwise fashion used by FORTRAN. This ordering was changed 
to coincide with the rowwise data access scheme used by the SL/1 
compiler. This first modification reduced the paging time 
incurred during the profile setup calculations. 

Secondly, necessary parts of the profile information vectors 
were windowed in, that is brought in to memory, as subvectors 
stored on small pages for calculations on the current block of 
integration points. This technique guaranteed that any faults 
caused by paging current profile information in and out of central 
memory would be small page faults rather than the more time con- 
suming large page faults. 

The third modification required a change in the order of 
summation over equations governing the calculation of the absorp- 
tion coefficients in order to more efficiently implement the 


algorithm on the STAR-100. 

The calculation of the absorption coefficients k^j n (w) at 
integration point w for layer j and gas i is the sum of contribu- 
tions of all lines n within the interval about w expressed as: 


^ij (w) ^ijn ^ijn(w) 

where S. . is the temperature corrected line intensity value and 
i^n 



is the line shape function value for line n within the inter 


val about w. (Ref. 2) . 
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In discretized vector form, this summation was originally 
performed by a set of triple loops: 



NUMN 
= Z 
n=l 


NLAY 

Z 


j=l 


NGAS 

S . (w) 3 • (w) Z gc. 
nil n D i=1 i 


where NLAY is the number of layers in the model, NUMN is the 

number of included lines and NGAS is the number of active gases. 

The vectors S . and 3 • which contain all of the lines spanning 

nn n] 

the interval about vector block w for all active gases were 
fetched unnecessarily deep in the loop nest. 

Accesses to the profile information vectors S . and the 
line shape vector 3 . in their entirety were made NUMN x NLAY 
times per block. This same result could be obtained by the 
following set of loops: 


k 


i jn (w) 


NLAY 
= Z 

j = l 



NUMN NGAS 
Z Z gc 
n=l i=l 


with accesses made to the whole vectors on the order of only 
NLAY per block. 

The order of the two outer loops in the STARSMART code 
v/as reversed; this modification reduced the paging of profile 
subvectors by a factor of NUMN. Since NUMN can be a large 
number in dense spectral regions, this third modification 
made the most significant contribution in reducing the thrashing 
problem. 
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The previous modifications were incorporated and validated 
on a small scale by running a case for methane (Table 2) to check 
for degradation caused by the windowing overhead. The modifica- 
tions were then tested on a large scale by the submittal of the 
15 layer midlatitude summer atmosphere case. The program ran to 
completion in less elapsed time than the initial algorithm took 
to calculate values for one block (Table 3) . 
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SECTION 6 - MACHINE PRECISION AND ACCURACY 


The advantages of the STAR-100 halfword arithmetic - 
increased speed and decreased data base size - are diminished 
to some degree by the loss in precision. To assess the effects 
of 32-bit arithmetic on the transmission calculations, the 
characteristics of the data base and the computational procedure 
were studied in depth. 

Despite its 64-bit wordlength, the STAR-100 has a full 
precision of only 14 significant decimal digits and a half 
precision of 6 or 7 digits. The CDC 6000 series machines achieve 
single precision results with 14 or 15 significant digits. Using 
STAR halfword arithmetic with estimated speedup factors of 2 
and 4 for addition and multiplication and the corresponding 
reduction by 2 of the operand data base could lead to a more 
cost effective use of the machine. The tradeoffs between accuracy 
and economy were taken into consideration. 

Errors during floating point calculations on a digital machine 
arise from several sources. The first source is the machine 
representation of any number having a fractional part not an 
integer power of the machine base. Such representations will 
have a maximum error of h 3^ ^ where 6 is the base and t the 
number of bits in the fractional part of the word. This number, 
the machine epsilon, is directly related to machine precision 
and on the STAR-100 the halfword value is on the order of 1.0E-28. 
Another source of error is roundoff committed during the operand 
normalization and actual machine arithmetic. Errors arising 
at each step of a calculation can be bounded by the machine 
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epsilon, but these round-off errors will propagate. Calculations 
for the solution of problems by unstable algorithms can lead to large 
relative errors, inaccurate solutions, and at worst, loss of 
significance. Another source of error related to precision is 
loss of significance caused by use of operands out of the range 
of machine representation or intrinsic system routines such as 
sine and cosine. These possible sources of error in the STARSMART 
code were studied and conditions leading to unacceptable levels 
of error were corrected. 

The error associated with machine representation is well 
below the threshhold of certainty of the input parameters for 
STARSMART. The spectral line parameters read in from the McClatchey 
AFCRL tape (Ref. 9) are not known to full machine significance. 

The value of line location is known to + 0.05 cm-1, while values 
of ground state energy and halfwidth are known only to + 5-10%. 

The use of half precision did not cause loss of significance 
during the input of the data base. 

The round-off error caused by 32-bit arithmetic used in 
such functions as the Voigt profile was determined to be acceptable 
by parallel runs of STAR-100 and 6000 series algorithms. 

The exponent range of 32-bit operands was considered because 
loss of significance can occur during floating point arithmetic. 

On the STAR-100 a floating point number is represented as c * 2**E 
where E is a valid signed exponent and c is a valid signed co- 
efficient. Values are kept within the machine range by shifting 
the binary point and adjusting the exponent. The implicit dependence 
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of the range of E and c upon the precision of the machine can 
be expressed by the following: 

x = + (a-L s" 1 + a 2 b“ 2 +...+a t 

characterized by the number base $, precision t and an exponent 
range (l,u) . The integers a^ (i = 1 to t) lie within 0 and 8-1 
and 1 <_e<_u. (Ref. 8) When E exceeds these bounds or c vanishes 
during a calculation not resulting in a true zero, a loss of 
significance is said to occur and no digits in the computed result 
can be trusted. 

The STAR-100 halfword exponent and coefficient bounds allow 
an operand range of + 2.1e + 40 to + 8 . le - 28 . Any numbers out- 

side of this range cannot be represented by the machine! It was 
discovered that the line strengths for some of the weaker spectral 
lines processed by STARSMART are on the order of l.Oe - 30. Thus 
all contributions from these lines were truncated to zero and the 
calculated absorption coefficients falsely indicated no absorbers 
in the neighborhood of weaker lines. 

The transmission x at wavenumber w and layer 1 is defined: 

t = (- 0P n < w ) *9^. ) 

w, 1 e (Ref . 2) 

where OP^ is the optical path in molecules/cm2 , ki (w) the absorption 
for gas i and gc^ the fractional gas concentration. The values of 
the absorption coefficients were on the order of l.Oe - 30 and the 
values of the optical path about 1.0 e + 24. To preserve signifi- 
cance the optical paths were postmultiplied by a value of l.Oe - 24 
and the line strengths were premultiplied by a factor of l.Oe + 24. 
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These operations algebraically cancel in the preceding equation 
and keep the operands within the range of 32-bit operations. 
Comparisons between NOS and STAR— 100 absorption coefficients 
and transmissions revealed a relative error at specific wave 
numbers of 0.30% for the improved algorithm versus that of 7.78% 
for the original problem. (Table 4) 

Another possible source of error in the STARSMART calculations 
considered was that of invalid arguments supplied to intrinsic 
functions. Such functions as natural log (LN) and exponent (EXP) 
have narrower ranges and the use of out-of-range operands will 
cause unpredictable and unflagged results. Code to perform validity 
checking of operands and to issue diagnostics was implemented in 
the revised program. No current runs have caused any error diagnos- 
tics to be issued. 

On the basis of the above findings, the halfword precision 
was kept in the STARSMART program. The only operations still per- 
formed in full precision are the I/O transfers which must be handled 
by the 64-bit operating system routines. 
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SECTION 7 - I/O HANDLING 


The STAR-100 is a powerful computing machine, but its input/ 
output handling capabilities are quite crude in comparison to those 
of the general purpose user oriented NOS serial machines. At times 
it is more expensive to store results because of costly I/O over- 
head than it is to recompute them. However, graphical display was 
one of the required products of the instrument simulation package 
and there is no plotting capability available on the STAR-100. 

Thus absorption coefficients would have to be stored for post- 
processing. For realistic atmospheres, the number of coefficients 
approaches one million and the storage, manipulation and conversion 
of those values have a large impact on the overall efficiency of any 
STAR-100 program. 

The STAR-100 supports two types of I/O operations: explicit 

and implicit. Explicit I/O is the more familiar data transfer 
which is activated by READ and WRITE statements. Implicit I/O 
is data transfer which occurs without specific commands. Examples 
are the paging caused by hardware translation to get data or code 
during execution and the mapping of files. 

Explicit I/O can be performed in two modes: formatted (coded) 

and unformatted (binary) . Coded data transfer uses more space 
to represent numbers and requires conversion from the machine 
floating point representation to external ASCII. Binary data takes 
less time to process and has a higher density; however, an extra 
step is required to pass binary from STAR to the CYBER machines. 
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STARSMART initially had no I/O capability except that of 
listing results on the printer. During the first phase, the 
absorption coefficients were written out to a temporary disk 
file by binary WRITE commands and then converted to CDC 60-bit 
floating point and passed over the data channel to the NOS machines 
for graphical post-processing. For small cases this worked quite 
well; but for the large cases the operating system aborted the 
job because of output file overflow. The overhead caused by 
explicit I/O operations forced the system to extend file lengths 
to the maximum. The impossibility of reducing the size of the 
output file led to the development of an implicit I/O strategy. 

Use of implicit I/O requires that the user become very familiar 
with the workings of the operating system and understand his file 
needs. The user must be responsible not only for determining the 
file size, but also for organizing the data and selecting the 
blocksize best suited to the problem. Implicit I/O avoids the 
file overhead caused by multiple entries of data and extra trans- 
fers through buffers which occur during explicit I/O. STARSMART 
was modified to invoke SL/1 system function statements which map 
I/O files to specific contiguous disk locations reserved by loader 
options. The READ and WRITE statements were replaced with references 
to locations within the data area which is mapped to a user file. 

Several data structures to represent the absorption coefficients 
generated by a problem with NLAY layers, NGAS gases and NWAVE inte- 
gration points were tested. The most obvious structure, that of a 
3-dimensional array of dimension NLAY x NGAS x NWAVE indexed by layer 
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number, gas number and wavenumber was not the most efficient 
because the references to the structure were not over optimal 
length vectors. The final structure used was a 3-dimensional 
array of dimension (NWAVE/blocksize) x (II GAS x 1 1 LAY ) x blocksize 
which was selected after small scale tests we re run. (Table 2) 

Since this representation referenced the optimal length blocksize 
vectors tiirough the gases and layers in a rowwise order, the 
amount of pagefaulting was reduced. 

Although the mapped file had to be processed by an SL/1 
routine which placed it in binary format for system conversion; 
the cost of post-processing the mapped file was small in comparison 
to the overhead used to generate the large binary file during 
calculation. 
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SECTION 8 - RESULTS 


Optimization of STARSMART code has reduced the time and 
cost of runs to calculate absorption coefficients for large 
atmospheric models. For example, a 15-layer atmospheric model 
spanning the 140 cm-1 bandpass of the Pressure Modulated Radio- 
meter now executes 56 times faster than when processed by the 
original STAR-100 code. Indications of better utilization of 
the powerful STAR-100 central processing unit can be seen in 
the system activity summaries at the end of user printouts. 

The number of large and small page faults and the percent of 
CPU time spent in paging have been reduced significantly. 

Elapsed clock time and turnaround time also were reduced. 
Results which formerly required 2 to 3 days to process now 
return to the user in less than a day. 

From a software design viewpoint, the STARSMART code is a 
considerable improvement over the initial codes. STARSMART takes 
advantage of the features offered by the SL/1 compiler to present 
clear formatted source in a well-structured modular form. The 
"ruggedized" code prevents executions with mismatched parameters 
and issues diagnostics to flag potential problems. 
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SECTION 9 - CONCLUSIONS 


Proper utilization of the STAR-100 has resulted in a gen- 
eralized program to calculate and store absorption coefficients 
which executes much faster than the original STAR-100 code and 
faster still than the comparable GDC 6000 series code. The steps 
taken during the adaption and optimization of the code have been 
described. 

Although virtual memory remedied the problems of core limita- 
tion experienced on the serial machines, it was not a panacea. A 
straightforward translation of an efficient serial code often per- 
forms poorly on the vector machine because the designer has neglected 
to consider the side-effects of virtual memory on vectorization and 
data structure. 

In the case of STARSMART, operations taken for granted on a 
serial machine, such as I/O and array accessing, caused an inordinate 
amount of paging on the STAR-100. To overcome these adverse effects 
a study on the impact of virtual memory had to be undertaken. 

Not all problems are suited to the use of halfword arithmetic. 
Thus, the equations describing the model, as well as the input 
parameters, were studied carefully during the evaluation of half- 
word arithmetic. Fortunately the STARSMART code could use half- 
word arithmetic and take advantage of the resulting speedups in 
operations and reduction in data base size. 

The selection of SL/1 resulted in faster debugging and check- 
out, as well as faster execution time than could have been provided 
by the other available languages. SL/1 provides the control 
structures necessary for clear quality code. STARSMART is easier 
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to read and modify than the original codes; this additional clarity 
makes the program a better tool for other researchers in this area 
of atmospheric work to use. 

The STAR— 100 code shows an improvement in time and cost over 
the serial code when used for larger problems. The greatest 
advantages are gained in regions of dense spectral lines. It 
is these cases which are not even possible to run on the serial 
machines. Thus STARSMART has expanded the range of user definable 
models which can be realized on the available machines. 
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TABLE 1 


EFFECTS OF PRECISION AND 
COMPILER OPTIMIZATION 
UPON COMPUTATIONAL SPEED AND STORAGE 
OF SELECTED APPROXIMATIONS OF 
THE VOIGT PROFILE 


ALGORITHM 

TIMING 

(CPU SECONDS) 

STORAGE (WORDS) 
CODE DATABASE 

_ (1) 
Drayson 

0.1669 

— 

— 

Pierluissi ^ 

0.1568 

— 

— 

Pierluissi 

64-Bit Math 
W/Check Option 

0.0560 

258 

1325 

0.0683 

726 

1325 

Pierluissi 

32-Bit Math . . 

W/Check Option 

0.0558 

251 

843 

0.0729 

762 

343 

Pierluissi 

32-Bit Math 
Generalized Regions 

0.0568 

254 

843 

^101 multiple calls 

to scalar routines as compared to 

succeeding 

cases performed as 
| v | = 101. 

NOS scalar timing 

single calls to 
for both methods 

process vectors of 
was 0.05 seconds 

length 


( 2 ) 


Extra runtime code generated by the compiler for diagnostic 
purposes caused timing and storage overhead. 



TABLE 2 


EFFECTS OF PAGING AND I/O MANAGEMENT 
UPON A SIMPLE ATMOSPHERIC MODEL 


METHANE CASE 

GASES = 1 LAYERS = 1 SPECTRAL LINES =740 
MAXIMUM DENSITY = 84 4200 - 4490 CM-1 BANDPASS 

INTEGRATION POINTS = 7250 


CRU ' S 


ORIGINAL ALGORITHM 
(3-D ARRAY 
EXPLICIT I/O) 

3-D ARRAY 
IMPLICIT I/O 
LARGE PAGES 

3-D ARRAY ^ 1 ^ 
IMPLICIT I/O 
SMALL PAGES 

2-D ARRAY ^ 
IMPLICIT I/O 
SMALL PAGES 


26.04 


49.50 


13.45 


12.95 


PAGE FAULTS 
LARGE SAMLL 

4 22 

291 116 

4 911 


% OF CRU'S 
SPENT PAGING 

4.4 6 

56.5 

14 


6 117 11 


(1) 


Selected Data Structure 


( 2 ) 


This data structure was not selected because the number of 
absorption coefficients calculated by large modlls generaflv 
exceeds the maximum vector length of 65,534 enforced by SL/1. 


TABLE 3 


RESULTS OF OPTIMIZATION OF 
STARSMART CODE 


CASE I. MIDLATITUDE SUMMED ATMOSPHERE (SMALL) 


GASES = 4 LAYER = 15 SPECTRAL LINES = 211 
INTEGRATION POINTS = 2500 BANDPASS 2000-2025 cm 




INCREMENT = 

0.01 cm 


CRU ' S ( 1 ) 

PAGE 

LARGE 

FAULTS 

SMALL 

%CRU'S 

PAGING 

ORIGINAL 

ALGORITHM 

15.774 

6 

820 

10.7 

IMPROVED 

ALGORITHM 

14.412 

3 

388 

• 

m 


CASE II. MIDLATITUDE SUMMER ATMOSPHERE (LARGE) 
GASES = 4 LAYERS = 15 SPECTRAL LINES = 4666 


INTEGRATION POINTS = 14000 BANDPASS 2075-2215 cm 


MAXIMUM DENSITY 


INCREMENT = 

0.01 cm 

OF SPECTRAL LINES = 645 





CRU'S 

PAGE 

FAULTS 

%CRU ' S 



LARGE 

SMALL 

PAGING 

ORIGINAL (2) 
ALGORITHM 

6751 

42840 

39984 

81.8 

IMPROVED 

ALGORITHM 

151 

698 

3356 

61.8 

FINE TUNED (3) 
ALGORITHM 

105 

350 

2540 

30.9 


CRU is the standard Computer Resource Unit used for accounting 
purposes . 

2) . h 

This algorithm only processed 1/56 of the desired bandpass. 

The extrapolated cost is 6751 CRU's. 

The array dimensions were set very close to the minimums required 
for the bandpass. 


TABLE 4 


LAYER 


1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

14 

15 

( 1 ) * 

X £ 

(2) * 
T „ 


PERCENT RELATIVE ERROR FOR TRANSMISSIONS 
AT INTEGRATION POINTS 


w = 2077 


w = 2075 


* 

( 1 ) 

* 

'T _ 'T 

( 2 ) 

* 

i T — T 

( 1 ) 

£ £ 

xlOO 

£ £ 

xlOO 

| £ £ 

xlOO 

T £ 

T £ 

1 T £ 


T — T 

£ £ 


( 2 ) 

1x100 


1.07 

0 

1.87 

0 

2.6 

0 

3.03 

0 

3.88 

0.15 

3.96 

0.25 

4 . 5 

0.41 

5.0 

0.69 

5.89 

0.12 

6.32 

0.20 

6.90 

0.34 

7.08 

0.23 

7.36 

0.27 

7.55 

0.37 

7.78 

0.30 


0.69 

0.50 

1.01 

0.50 

1.50 

0.70 

2.00 

0.80 

2.52 

1.06 

2.87 

1.07 

3.07 

1.24 

3.24 

1.24 

3.41 

1.25 

3.53 

1.25 

3.72 

1.25 

3.91 

1.25 

3.91 

1.25 

4 . 07 

1.09 

4.07 

1.09 


is total transmission at top of layer £ for original 
algorithm. 

is total transmission at top of layer £ for improved algorithm, 
is total transmission at top of layer £ for serial algorithm. 



TABLE 5 


COMPARISON OF STAR-100 AND 
CDC 6000 SERIES EXECUTION 
OF RADIATIVE TRANSFER CODE 

CASE I . 15 LAYER MIDLATITUDE SUMMER ATMOSPHERE 

4 GASES 2075-2175 CM-1 (Ref. 5) 



CDC 

STAR-100 

CPU SECS. 

2898.8 

29.2 

COST FACTOR 

(CDC/ STAR) 

7 

CPU FACTOR 

(CDC/ STAR) 

99 

CASE II. (1) 6 

LAYER N 2 0 CASE 

2075-2215 CM-1 

CPU SECS. 

5122 

92.6 

CLOCK TIME 

3 HOURS 

1.5 MINUTES 

TURNAROUND TIME 

2 DAYS 

OVERNIGHT 

COST FACTOR 

(CDC/ STAR) 

1.69 

CPU FACTOR 

(CDC/ STAR) 

55 


1 ^The cost and CPU factors for CASE II are less because 
the case deals with a region of denser spectral lines 
and the overhead of storing the calculated absorption 
coefficients . 
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